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A NEW ESTIMATOR FOR THE NUMBER OF SPECIES 

IN A POPULATION 
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'q^', University of Firenze and Missouri University of Science and Technology 

^— , We consider the classic problem of estimating T, the total number 

' of species in a population, from repeated counts in a simple random 

sample and look first at the Chao-Lee estimator: we initially show 



(N 



^ , that such estimator can be obtained by reconciling two estimators of 



the unobserved probability, and then develop a sequence of improve- 
ments culminating in a Dirichlet prior Bayesian reinterpretation of 
. the estimation problem. By means of this, we obtain simultaneous 

estimates of T, the normalized interspecies variance 7^ and the pa- 
rameter A of the prior. Several simulations show that our estimation 
Qh ' method is more flexible than several known methods we used as com- 

' parison; the only limitation, apparently shared by all other methods, 

seems to be that it cannot deal with the rare cases in which 7^ > 1. 

'a 
» 
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1. Introduction. We consider the classic problem of estimating the 



> 

o . 

I number T of species in a population, and, subsequentely, their distribution, 

■ from a simple random sample drawn with replacement. We are interested 
in the "small sample" regime in which it is likely that not all species have 

I been observed. Problems of this kind arise in a variety of settings: for exam- 

QQ ' pie, when sampling fish from a lake or insects in a forest (see, for instance. 



■ Shen, Chao and Lin (2003) 47|] on how to use estimates of T to predict 

further sampling, or [7[); or when estimating the size of a particular popula- 
tion (see [^); or when trying to guess how many letters an alphabet or how 
many specific groups of words a language contains (see [3]) or how many 
words a writer knows (see |19i]): or, even, when determining how many dif- 
ferent coins were minted by an ancient population (Esty [2]|). Because of 
its great interest this has become a classic in probability, and there has been 
a great number of studies suggesting methods for the estimation of T. See, 
for instance, 0| for a review through 1993, ^] for some further details and 
Colwell's Estimates for software implementing a large number of estimators. 
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In particular, [8] calls for some development of the Bayesian method for the 
estimation of T, which is the direction that we eventually have taken. 

In this paper we start, in fact, by analyzing one well known estimator of 
T, namely the one by Chao and Lee ([l^). One of our results shows that the 
estimator can be obtained by reconciling two estimators of the unobserved 
probability U: one being an extended version of Laplace's "add A" ([34;]) 
and the other the estimator by Turing and Good ([2J])5 provided that the 
normalized interspecies variance 7^ is interpreted as the inverse of the A. 
Then we proceed by developing simultaneous methods for estimating T and 
A (or 7^, which is the same). 

By such methods we improve on the original Chao-Lee estimation, but 
the estimators we obtain are shown by simulations to have some serious 
defects. It is for this reason that we perform a more fundamental analysis of 
the problem by means of a Bayesian approach. This is based on a Dirichlet 



prior with parameter A on the probabilities of T species (see [33| , [3^ , [25|] , 
and [4^ for an historical description); the parameter turns out to be the 
same as the one in Laplace's method. The simultaneous estimation that we 
develop now takes into account a posterior second moment of the random 
species probabilities compared to the classical Good Toulmin estimator for 
the same quantity (see j27j|). 

Let us mention that the empirical Bayesian approach used here is different 
from that of existing results in the literature. The method in [4l| is, in fact, 
limited to uniform species distributions. On the other hand, the general 
Bayesian approach in Boender and Rinnoy Kan (1987) [3] starts from a 
prior distribution of T and, conditionally to T, a uniform or Dirichlet (A) 
prior on the species probability, but then introduces a (level III) prior on 
A itself (as suggested in (26| ) which in turn requires the introduction of a 
further parameter (Boender and Rinnoy Kan (1987) [3], formulae (10) and 
(11)), with then no analytical expression for the posteriors. In the end, this 
direction seems to include several undetermined choices (the prior on T and 
the extra parameter at level III) and no simple analytical expression of the 
estimators. 

At the end of the paper we present some numerical tests. Due to the 
inherent difficulty in finding fully published data for this estimation we resort 
to simulations and real tests on discovering the size of an alphabet. The tests 
seem to indicate that the new estimator of T is more flexibile than existing 
ones and thus preferable, in the sense that the performance of all estimators 
seem to greatly depend on the normalized variance 7^ , and the new estimator 
is the only one able to perform rather well for all values of 7^ E [0, 1]. In 
our method, the only constraint is that A > 1, which is 7^ < 1, which is 
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imposed in order to ensure convergence of the prior; this, in turn, imposes 
a mild hmitation on the populations to which the method can be applied, 
since 7^ can, for some peculiar population, exceed 1; on the other hand, such 
populations are likely to be quite unusual and, in addition, all other existing 
estimators seem also to fail on samples taken from them. 

In section 2 we review in detail some known estimation methods of interest 
in deriving our results; in section 3 we derive some relations between known 
estimators and our first improvements; in section 4 we develop the Bayesian 
method and define our final estimator; in section 5 we give estimates of the 
species probabilities from, for both the observed and the unobserved ones; 
from these, we indicate how to generate confidence intervals for T by means 
of resampling; finally, in section 6 we present some simulations revealing a 
rather good performance of our new estimator and also very adequate results 
of the confidence intervals. All detailed mathematical proof are deferred to 
the Appendix. 



2. Some known estimators of T and related quantities. We start 
with some notation. Assume that the population from which the sample 
is drawn has a total of T species (which we sometimes will call states) 
having proportions pi,P2,--- ,Pt-] and that in a sample xi,X2,-" i^n of 
size n there are observed species. For i = 1, - ■ ■ , T, let be the number 
of observations of the species i in the sample, so that X^i^i = ^- We 
assume that the mj's are given one of the possible orders in which rrii > 
1712 . . . ,mN > 1 and mj = for i = A^+1, . . . , T. Also, for j = 1, • • • , n, let rij 
be the prevalence of j, which is to say the number of species observed exactly 
j times, so that X]j=i "-i = ^- Next, let L„(z) = rrii/n be the empirical 
frequency of species i, so that C = J2i:Ln{i)>oPi coverage, i.e, the total 

probability of the observed species, and U = 1 — C = J2i:Ln{i)=oPi is the 
unobserved probability. We are interested in the estimation of T from the 
prevalences. 

The estimation of U has also been studied intensively (see, for instance, 
(40I and (s^). In fact, it is possible to turn the estimation of U into a 
simplified version of our original problem by assuming that there are + 1 
species, the A^ observed ones and the "new" species with probability U; the 
main issue becomes then the estimation of the probabilities of the various 
species and especially for the new one. For this and other reasons that we 
shall see, the estimations of T and U are closely intertwined (even the title 
of [iO] points to this relation). 

The first attempt to estimate U can be extracted from Laplace (see [34 1 
and [iH]) who suggested an "add-one" estimator: this consists in adding 



imsart-aos ver. 2007/12/10 file: CompletEstimatl4.tex date: April 8, 2008 



4 



CECCONI, GANDOLFI, SASTRI 



one to the number of observations of each species plus an additional one 
for the "unobserved" species. In an extended version, which can be named 
"add A", one can add some positive value A to each species' number of 
observations (including the unobserved one): an estimate of the probability 
of each observed species i is then pi = x+J\^\m +\) ~ n+{N+i)\ ^^"^ 
estimate of the unobserved probability becomes U^^x = T^qr^T^qrjy^- 



With a seemingly completely different method, Turing and Good (see [24 
proposed another estimator of U. Recall that ni is the number of species 
observed exactly once and n the size of the sample; then the Turing-Good 
estimator for U is some minor modification of: 

Utg = —■ 

n 

A plausible rationale for this estimator is that while for species observed 
at least twice the empirical frequency is already becoming stable and very 
likely close to the corresponding probability, species observed only once are 
likely to be randomly selected representatives of the collection of the yet 
unobserved species. A more sound mathematical derivation is in Good ([2J]), 
in which also a "'smoothing"' of the n^'s is proposed. 

Other methods to estimate U have been developed, and in particular we 
refer to [s^ for a Bayesian method based on the gen eral class of Gibbs-type 
priors (see also [i^ and the other references in [3^ for the definition and 
properties of such priors). This class contains several known families of priors 
as particular cases and each such family is based on one or more parame- 



ters, which need to be further estimated. In [38|], for instance, a maximum 
likelihood estimator is used. Another recent advance appears in Orlitsky et 
al ([iH]), in which a quantity is introduced, called attenuation, that mea- 
sures the effectiveness of the estimation of U as the sample gets larger; the 
performance of an estimator is compared to the maximum probability of 
the observed prevalences and asymptotically very good estimators are de- 
termined. 

We are going to base our work here on a preliminary estimation of U . It 
is conceivable that within the wide class of proposed estimators of U some 
would improve the results that we get; however, we focus on the unsmoothed 
Turing-Good estimator since it is more direct and simple, while still allowing 
us to achieve very satisfactory results. 

Getting back to the estimation of T, there are several parametric methods 
based on assuming some structure of the species distribution; for instance, 
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an estimator devised for the uniform case, in which the probabihties of all 
species are assumed to be the same is the Horvitz-Thompson 

Tht 



(see [3^ and Bishop, Fienberg and Holland (1975) [3]) and then U can be 
further estimated, for instance by the unsmoothed Turing-Good method, to 
get 

(1) Thttg 



1 _ Uj,^ n-rii 
16 1 and [B]- Esty 2C] improves this estimate by assuming a negative 



see 

binomial prior with parameter k to get 

- N uUtg 1 

then providing some ad hoc guess for k (in some cases, k = 2). 

As to nonpar ametric methods, Harris [2^ . Chao 12] and Chao & Lee [l3| 
have proposed some such estimators, of which the most reliable ones seem 
to be those proposed in [3]. In our notation these amount to 

with 7^ an estimate - for which Chao & Lee make two proposals - of the 
normalized variation coefficient of the pj's. In fact, assume that p is a random 
variable uniformly distributed on the T population probabilities pi, . . . ,pT], 
then its average is 

1 ^ 1 

k=i 

and its normalized variation coefficient is 

Next, Chao and Lee proceed by using an estimate of Good and Toulmin 
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and using one preliminary estimate for T, ([T]) for instance, to obtain 

-2 f nN ^ j{j - l)nj 
7 =max( l^—f 

Note that the work by Chao and Lee can be considered as a further im- 
provement over the results by Esty. However, Chao and Lee make a rather 
direct use of a preliminary guess for T and we think their method is too 
sensitive to errors in such preliminary evaluation. In the next section we 
start discussing some possible improvements. 



3. Preliminary results on new estimators. (I) We first consider ^ 
and ^ as equations in the unknowns T and 7^ and search for simultaneous 
solutions T > N and 7^ > 0. Since in some simple examples the unique 
solution gives 7^ < 0, we consider the solutions 71(7^) and 71 of the problem 

(6) T = T(7^) = + 

1-Utg {1-Utg) 

(7) f = arg inf 7^ - (TVbr - 1) 

7 >o 

with Vgt as in ([5]). On letting u = Utg and v = Vgt for brevity, the function 
to minimize becomes 

(1 — n + nuv)^'^ + 1 — n — Nv; 
note that (1 — n + nuv) > since n < 1, so that the solutions of © are 

^2 f if < u < 1 - iVu 

7i = < Nv-i+u ^ nYgt-i+Utg if I - Nv < u 

[ l-u+nuv i_u^^^nUTGVGT 

andfi = ri(72). 

Some tests described in section 6 show that Ti performs better for non 
uniform populations than the original Chao-Lee estimate, but has too large 
a variance. 

(II) Next we compare two estimators of C/, the unsmoothed Turing-Good 
and the following modified version of the "add A": assume the number T of 
species is known and add A to each of the frequencies of all the T species. 
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not just to that of those arbitrarily labelled through + 1. This would give 

- f\\ rrik + X 

vuIa] = — Tpei k = 1 . . . I\ 

Pfc(A) = jf,^ peik = N + l...T 
TX + n 



Ux 



{T-N)X 



TX + n 

since there are T — N unobserved species. Now, we can hope to reconcile the 
extended "add A" and the unsmoothed Turing-Good estimators by requiring 
that they assign the same value to U. This amounts to solving 

Solving for T we get 

(9) Tx = ^ = n . 

1 - Utg n-ni 

Quite surprisingly, we have obtained 

Lemma 3.1. The only value ofT for which the extended '"add X'" and 
the Turing-Good estimators ofU coincide, is the Chao-Lee estimator Tc 
with 7^ = 1/A. From now on we will assume this equality and mostly refer 
to the parameter X. 

(Ill) The relation found in (II) suggests that ([6]) can be seen as a first 
moment estimate: 

T 

(10) PkW = UTG, 

k=N+l 

SO that one can hope to derive 7^ from a second moment relation. The form 
is suggested by (I), considering the meaning of Vgt- 

T 

(11) A2 = arg inf \Y.Pk{Xf - Vgt\. 

- k=i 

The solutions f2(A2) and A2 of and ([II]), together with 7^ = X2^, 
give new estimators; although this seems to improve the estimation in some 
cases, it does appear to have significant flaws, as shown in the simulations 
reported in tables 1-3. 



imsart-aos ver. 2007/12/10 file: CompletEstimatl4.tex date: April 8, 2008 



8 



CECCONI, GANDOLFI, SASTRI 



4. The Bayesian interpretation. To further improve the above es- 
timate, we need to understand more about the "add A" estimator. It turns 
out, as was probably known already to Laplace, that the probability esti- 
mation according to the "add A" method is nothing but the average species 
probability under the Bayesian posterior on probability distributions on T 
species 

T 

= {P= {Pl,P2, ■ ■ ■ ,PT),Pi > 0,^Pi = 1}, 

i=l 

given the sample, with a single parameter Dirichlet prior /9o,t,A) i-e- a prior 
with density cH^iP^"^ for some constant c and A > 1. With likelihood 

n T 



j=l i=l 



the posterior becomes 
(12) Pn,T,\idn) = 



1 ^ 

= Pn,T,\{dfl) = i^^^EtYIpT''^^ ^dpi . . . dpT- 

where Z = J-^^ p^^^'^~^ ■ ■ ■ Pjv^^'^^^?'am-\ ' ' ' Pj^^dpi ■ ■ ■ dpx (note that the 
constant terms have been cancelled). 

By standard integration using the gamma function (see Appendix 1), we 
find that the average species probability under the posterior is: 

/ ^ f ^TJ^ ifi = l,...,iV 

I TXT^ ifi = N + l,...,T 

as claimed. This remark, together with our reconcilation Lemma in (I) above, 
indicates that we are taking a new step in the development which brought 
us from ([1]) to ([2]) and then to ^ by assigning now two other meanings for 
A~^ = 7^, namely that of the add constant in a generalized Laplace method 
and that of the constant in a Dirichlet prior. 

The Bayesian interpretation of pk also suggests a modification of the sec- 
ond moment minimization (llip . Recalling that now A > 1 we have: 

A = arginf |/(A)| 
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with 



/(A) = V-j2{E,^^,Jpl)) 

k=l 

n—ni 



Ei>o f'^j -n 2nA + A + nA(A + 1) 



n(n - 1) \n^^ + 11 [n^^^l 

where Tx has been taken as in ([9]) and the calculation is carried out in 
Appendix 1. In Appendix 2 we show the function /(A) has two singularities 
(32 < Pi = —jf < and two zero's, the interesting one being 

, , 1 — u — V + uv — uvn 

^^"^f -^2 = 77 — : --. • 

At; + u — 1 

The minimization depends on the sign of /(A) for large A which in turn 
depends on the sign of (A2 — Since /(A) is increasing for A > 1, if the 
limit for large A is negative, then the only reasonable value we can assign 
is 00, else there is a real solution for the minimization problem above: note 
that if A2 < 1 then we are forced to take A = 1. It is thus shown in Appendix 
2 that the minimization above yields the estimator 

1 if /3i < A2 and 1 > A2, i.e. ^^^g^ < ^ < 1 " 
A2 if (3i < A2 and A2 > 1, i.e. 1 - Nv < u < ^^^5^ 
00 if A2 < i.e. < u < 1 - Au. 

From do]) we get the following estimator of T: 

^ _ N + nUTG/X _i n^±^^l^ if /3i < A2 
1-Utg ifA2</3i. 

or, alternatively, 

r ^ if <u<i-v 

l—u 2—v+vn — — 

r = <^ N-Nv-nu ill- NV<U< ^~-(^+^) 
A I 1—u—v+uv—uvn — — 2—v+vn 



N 
1-u 



if < u < 1 - Aw. 



Clearly, 7^2 is not necessarely an integer while T is such, and we round 
it to the nearest integer. Notice that when A = 00 we get T-^ = Thttg- 
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5. Estimate of species distribution and confidence intervals for 

T. Since we now have an estimate for both the parameters T and A, we 
can use the posterior average probabihty of each species as an estimate of 
the species probabihties. For the observed species, i.e. for i = 1, . . . , N, this 
amounts to 

(14) p. = E, . 4m) = ^ = + 

^ ' "n.T^.A^^*^ r^A + n n + iVA 

This expression is correct also for A = oo in which case all species are es- 
timated to have probability (T)~^. Also note that these values are close to 
the unbiased estimator rm/n of the probability of the i-ih species and can 
be seen as a mixture of the Laplace add-A and Turing-Good estimators since 
they are obtained by adding A to the frequency mj of the N observed species 
(recall that n = Y^f=imi)^ but only after having assigned the probability 
U to the event that we will observe a new species; the estimate of each of 
the species is then reduced by the factor 1 — [/ to compensate for this 

and, in fact, iT-^ — N) ^^^zF^^ = U. This is likely to be a sensible way to 



make the attenuation of the Laplace estimator (see [45|) finite. An alterna- 
tive description of our estimator is then completed by using the previously 
estimated value of A. 

A simple approach for the unobserved species would be to uniformly split 
the probability U among the Tr^—N unobserved species and by the reconcila- 

A 

tion method in (IHl) and Q this would give — = — = . On the 

^ ^ ^ T^-N T^X+n n+NX 

other hand, notice that, since one can read ([10]) as ^ — J2k=iPki^) = ^ — Utg, 
the reconciliation method never used the moments of the pj's for i > N; 
therefore, we have some freedom in assigning the estimated values of the 
p'j^s for i > N. These values can then be estimated by taking into account 
the meaning of A~^ = 7^ as normalized species variance, or of some related 
quantities; we could then assign probabilities to the unobserved species to 
achieve the estimated normalized variance 7^ or to achieve some related 
equality. For simplicity we will actually focus on J2k=i pI ^^'^ its estimator 
V. This is a valid approach except when u < 1 — Nv, in which case /(A) < 
and V turns out to be too small to be a reasonable estimate of J2k=iPk'' 

that case we replace V with J2k=i^p (Pk)- Clearly 

n, ^, 

N N 
k=l k=l 
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by Jensen's inequality, and thus we require that the estimates pk of the 
probabihties of the unobserved species satisfy: 



A / A 



N 



k=N+l \ k=l ' A / k=l ' A 

We can use any two parameter distribution, such as for instance pi = 
ca^~^ for i = N + 1, . . . , T-^, and insist that 

(15) E P^ = UTG 

i=N+l 

and 

(16) E p' = ^- 

i=N+l 

Solving for c and a gives the estimated unobserved probabilities pi = Pi{c, a), 
which are used in the simulations of section 6 below to generate confidence 
intervals by resampling. 

It is easily seen that if T >> N then 

a(l — a) u/v 

and 

u(l — a) 

c « . 

a 



6. Simulations. In this section we present numerical simulations and 
tests of the performance of several estimators compared to those we have 
developed here. Tables 1-4 present the analysis of several populations in- 
creasing values of 7^. Tables 5-6 present some real tests based on discovering 
the number of letters in an alphabet from a long text. In table 7 we compute 
confidence intervals using a resampling based on the reconstructed species' 
probabilities as described in section 5 above. 

The estimators compared in tables 1-6 are Ti, T2 and defined here, 

then Tthttg from Tql from the Jackknife estimator with optimal 
parameter Tjk from [9|] (// indicates numerical errors due to small denom- 
inators), and T+i which is our (or the Chao-Lee) estimator with 7^ = 1. 
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In tables 1-4 each population is generated from T i.i.d. random variables, 
normalized to sum to 1; the resulting 7^ is determined as normalized inter- 
species variance; 1000 simple random samples of size n are then generated; 
finally, mean, SD and mean square error are computed for each estimator. 

Tables 5 and 6 test the letter content of some passages in English and 
Italian in order to detect the number of letters in each alphabet. Each table 
shows the results of taking 1000 samples of about 9000 letters each from the 
indicated texts. 

The conclusion that can be drawn from these tests is that estimator per- 
formances are seen to depend on 7^, with the 7^ presenting a consistent low 

value of the MSE as long as 7^ G [0, !]• Therefore, has the flexibility to 
adapt to the different values of the interspecies variance. In table 1, in fact, 
7^ ~ and the best estimators turn out to be Tthttg and Tql (in which 
clearly 7^ gets appropriately estimated), but all the estimators defined in 
the present paper perfom equally well. In the less uniform population in 
table 2, Jackknife and show the best performances; and in table 3 where 
7^ 1, the best estimator turns out to be T+i, while has only a slightly 

worse performance. Note that Ti and T2 show a very poor performance in 
table 2 and 3. 

Finally, table 4 shows an extremely skewed population, with 7^ very large, 
for which no estimator works properly. The reason for is that convergence 
of the prior imposes 7"^ = A > 1. 

Even in the alphabet test the performance of turns out to be overall 
best. 

Table 7 shows some simulations about confidence intervals for T based 
on samples of size n = 400 computed from T-^ by estimating the species 
probabilities pk as described in section 5 and then resampling 1000 times 
from the estimated population distribution. This process is repeated 100 
times and table 7 indicates, for the populations of tables 1-3 respectively, the 
percentage of times the confidence intervals hits the true value of T = 1000 
and the average size of the confidence interval. 

The hitting percentage comes out remarkably well, due to the good ap- 
proximation of the true population distribution by the estimated one. 
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T = 1000 


n = 500 


n 


= 1000 


n 


= 2000 


mean 


std 


MSE 


mean 


std 


MSE 


mean 


std 


MSE 


Ttg 


994 


79 


79 


999 


36 


36 


997 


16 


16 


TcL 


1010 


86 


87 


1009 


42 


43 


1000 


18 


18 


fjK 


1068 


96 


117 


1223 


84 


239 


1117 


165 


203 


f+1 


1759 


157 


775 


1580 


73 


585 


1309 


32 


311 


fl 


1003 


82 


82 


1005 


39 


40 


1000 


18 


18 


f2 


1017 


83 


86 


1010 


38 


40 


1027 


54 


60 


Tx 


1087 


193 


212 


1026 


60 


66 


1002 


20 


20 



Table 1 

Uniform population: pi 's ~ N{0, 1), 7^ « 0.009. 



T = 1000 


n = 500 


n 


= 1000 


n 


= 2000 


mean 


std 


MSE 


moan 


std 


MSE 


mean 


std 


MSE 


Ttg 


781 


59 


226 


816 


30 


186 


858 


15 


142 


TcL 


808 


72 


205 


847 


40 


158 


893 


20 


109 


fjK 


962 


88 


96 


1034 


88 


94 


1054 


763 


764 


T+i 


1342 


116 


361 


1245 


59 


252 


1118 


30 


122 


Ti 


796 


64 


213 


835 


35 


168 


884 


19 


117 


T2 


787 


59 


220 


816 


30 


186 


858 


15 


142 


Tx 


915 


189 


207 


891 


65 


127 


912 


26 


92 



Table 2 

Less uniform population: pi's ~ f7[0, 1], 7^ w 0.3317. 



7. APPENDIX 1: The Bayesian approach. By definition of the 
gamma and beta functions V(x) = J^°° e~H^~^dt x > and 



Jo 



r(i + v) ' 

taking z = y/{l — x) we get 

f~\%l-x-yfdy = j\l-xr'^'z<^{l-zfdz = (1-^)^+^ r(a + l)r(6 + 1) _ 
JO Jo I [a + + z) 

Next, let Pn,T,x be the Bayesian posterior, given a sample with species records 
nil, ■ ■ ■ ,'>T^N, from a Dirichlet prior with parameter A on 

T-l 

Qt = {p= {Pi • ■■PT-i) : Pfc > 0, ^ pfc < 1}. 

fe=i 
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T = 1000 


n = 500 


n 


= 1000 


n 


= 2000 


mean 


std 


MSE 


moan 


std 


MSE 


mean 


std 


MSE 


Ttg 


620 


43 


382 


659 


24 


341 


759 


16 


241 


TcL 


690 


65 


316 


784 


46 


220 


888 


30 


115 


TjK 


870 


119 


176 


955 


432 


435 


1027 


661 


662 


T+1 


1036 


85 


92 


990 


47 


48 


1013 


30 


30 


fi 


658 


52 


345 


733 


33 


268 


845 


23 


156 


fa 


620 


43 


382 


659 


24 


341 


759 


16 


241 


Tx 


910 


164 


187 


973 


66 


71 


1001 


42 


42 



Table 3 

Non-uniform population: pi 's ~ Exp{l), 7^ m 0.9992. 



T = 1000 


n = 500 


n 


= 1000 


n 


= 2000 




mean 


std 


MSE 


mean 


std 


MSE 


mean 


std 


MSE 


Ttg 


192 


10 


808 


228 


8 


772 


261 


7 


738 


TcL 


262 


26 


737 


346 


28 


654 


416 


29 


583 


fjK 


326 


486 


830 


// 


// 


// 


// 


// 


// 


T+i 


271 


19 


729 


304 


15 


696 


334 


14 


666 


Ti 


231 


16 


768 


291 


14 


708 


344 


14 


656 




192 


10 


808 


228 


8 


772 


261 


7 


738 




271 


19 


729 


304 


15 


696 


334 


14 


666 



Table 4 

Extremely skewed population: pi 's ~ r(l, 1), 7^ « 9.1289. 



Note that Pn,T,x is invariant under permutation of the Pfc's, so it is vahd to 
express any result via a permutation of indices from a proven statement. 
Therefore, in the following Theorems it is sufficient to prove the results for 
some index i. 

Theorem 7.1. For evey A > 1 and for every i = 1 . . .T, 
(17) = =i±^. 

Proof. For i e {1 . . .T - 1} we have: 

^Pn,T,xiPi) = / PiPn,T,x{dn) 

JQt 

Iq^ pt^^-' ■ ■ ■ yr+^ ... (1 - PI - ... - pT_,rT+x-idp, . . . dpT-i 

Iq^ PT^^-' ■ ■ ■ pT^^'^ ... (1 - pi - ... - py_i)-T+A-lrfp^ . . . dpT-l ' 
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T = 26 


n = 15 


n = 25 


n = 50 


media 


std 


MSE 


media 


std 


MSE 


media 


std 


MSE 


Ttg 


19.8 


10.3 


70 


18.6 


3.6 


56 


19.5 


2.5 


43 


TcL 


22.7 


13.3 


71 


20.7 


5.6 


56 


21.6 


4.0 


37 


TjK 


19.4 


6.9 


60 


23.0 


9.8 


58 


// 


// 


// 


f+1 


33.3 


20.3 


109 


27.0 


7.0 


56 


25.6 


4.8 


29 




26.9 


15.8 


87 


23.2 


7.5 


60 


23.4 


5.3 


36 



Table 5 

Estimates for the 26 letters English alphabet from samples drawn from fl^; 7^ ^ 0.7029 

(see mi) 



r = 21 


n = 15 


n = 25 


n = 50 




media 


std 


MSE 


media 


std 


MSE 


media 


std 


MSE 


Ttg 


16.0 


6.9 


62 


15.7 


3.4 


43 


16.7 


2.0 


31 


TcL 


18.4 


9.8 


71 


17.7 


5.4 


45 


18.5 


3.3 


27 


TjK 


16.9 


6.3 


51 


19.6 


8.2 


65 


// 


// 


// 


f+i 


26.3 


14.3 


113 


23.1 


6.8 


48 


21.5 


4.0 


27 


Tx 


21.5 


12.2 


90 


19.9 


7.3 


49 


19.8 


4.3 


29 



Table 6 

Estimates for the 21 letters Italian alphabet from samples drawn from f^J; 7^ « 0.5932 

(see 0/; 



For A; = 1 ... T, let 

Sk = mfc + A - 1 
Sk = Sk + 6ik,i) 

where 5 is the Kronecker delta and for A; = 1 ... T — 1 let 

i{k) = f pl'...pl'{i-pi-...- pkY^^-+'^+^^^~^-^dpi ...dpk 

and 

Q,^. ^ T{sk + l)T{sT + ■■■ + Sk+i + T -k) 
T{sT + ... + Sk+i + Sk + T -k + l) 

and let I{k) and G{k) be as the quantities without hat but with Sk replacing 
Sk, so that 

/(T-l) 



/(T-l) 
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Confidence level 


Population — > 


1 


2 


3 


90% 


fraction of hits 
average interval size 


93% 
1115 


92% 
821 


80% 
707 


95% 


fraction of hits 

average interval size 


95% 
1225 


98% 

889 


89% 
827 


99% 


fraction of hits 
average interval size 


97% 
1520 


100% 
1064 


98% 
977 



Table 7 
"mances at 
resampling. 



Summary of confidence interval performances at the given confidence level from T-^ by 



Now we have 



r{ST + ST-l + 2) 

= C(r_i)r(^T-. + i>r(''T + .T- + 2) 

^ ' r(sT + ST-l + ST-2 + 3) 



and 

/(T- 1 



r(sT + ... + SI + T) 



r(sT + 1) . . . r(si + 1) r{sT + 1) . . . T{si + 2) . . . t{s^ + 1) 



r{sT + ... + SI+T) T{ST + ... + S1+T+I) 

Therefore, 

. , r(., + 2)r(.,- + ... + .si + r) _ ., + 1 

T{si + l)T{sT + ... + si + T + l) ST + ...SI+T 

mj + A rrii + X 

mi + . . . + niT + TX TX + n 

It is easily verified that X!fe=i xx+n ~ ^' 

Moreover, adding these values over the T — N unobserved species we get 
an estimate of U: 

mi=0 i=N+l lA-tri 

□ 



imsart-aos ver. 2007/12/10 file: CompletEstimatl4.tex date: April 8, 2008 



ESTIMATOR OF THE NUMBER OF SPECIES 



17 



Lemma 7.1. For every A > 1 and i,j = 1 . . . T such that i ^ j, 
(18) ^p^,t,MPi) 



{rrii + \){mj + A) 



(rA + n + l)(TA + n) 
Proof. Following the proof of Theorem 17.11 let, for k = 1 . . .T, 

Sk = mk + X-1 

Sk = Sk + 5{i,k) + 5{j,k), iy^j, l<i,j<T-l 

Thus 

/(T-l) 



Jqt n-^ - iJ 

where 

r(sT + i)...r(si + i) 



i(r-i) = 
7(r-i) = 

Therefore 

^Pn,TAPlPj) 



T{sT + ■.. + SI+T) 

TjsT + 1) . ■ ■ Tjsi + 2)... rjsj + 2) . . . r(si + 1) 
T{sT + ■.. + SI + T + 2) 

Tjsi + 2)T{sj + 2)T{sT + ... + SI + T) 
T{si + l)T{sj + l)r(sT + ... + S1 + T + 2) 

{Sj + l){Sj + 1) 

{Esk + T){Esk + T + l) 
_ {rrii + X){mj + X) 
(TA + n)(rA + n + l) 

□ 



Lemma 7.2. For every A > 1 and for every k = 1 . . .T, 

(19) ^Pr.,Tjpl) 



2. _ (mfc + A + l)(mfc + A) 



(rA + n + l)(rA + n) 
Proof. As in Theorem \7A\ for A; = 1 ... T and i G {1 ... T - 1} let 

Sk = nik + X-l 
Sk = Sk + 25{k,i) 
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So, Ep^_^^pf) = Jq^ p1 Pn,T,x{dn) = where 

r(sT + i)...r(si + i) 



/(T-l) 

7(r-i) 



T{sT + ■■■ + si+T) 

r(sr + i)...r(si + 3)...r(.si + i) 



r{sT + ... + SI + T + 2) 
Therefore, for i = 1, . . . , T — 1, 

. 2^ r(5, + 3)rGsT + ... + SI+T) ^ (s, + l){Si + 2) 

""'"'^ ^''^ ^ r(., + i)r(.^ + . . . + .1 + T + 2) (xl^^ s, + r)(ELi + ^ + 1) 

{mi + X){mi + X + l) 



(rA + n)(rA + n + 1) 



Lemma 7.3. /f g = Z^j>o = Sfc=i "^1 '"^e have 

(20) V F 1 - 9 + ^(2A + l)+T(A^ + A) 

(20) E (Pfe) - (rA + n + l)(rA + n) 

Proof. We have 

^ ,2^ ^ (mfc + A)(mfc + A + l) 

2^ ^Pn,T,A IPfci A. + n) (TA + n + 1) 



□ 



k=l 



Z>-nl + n{2\+l) + T{X^ + X) 

{TX + n + l){TX + n) 
g + n(2A + l) + r(A^ + A) 
(TA + n + l)(rA + ra) 



□ 



8. APPENDIX 2: Some properties of the function defining A. 

Let u = U and v = V. Wc consider now u and v as free variables satisfying 
some requirements satisfied by the values that, in fact, U and V take on in 
our estimation, namely U = ^ and V = Vgt = Ej>i ^n(n-i)' • 
Also let q = vn{n — 1) + n = Ej>o J^'^j- Then 

Lemma 8.1. For every sample, U + V < 1 
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Proof. Since q = Zlj>oi^% ^^'^ ^ = J2j>oj'>T'j we have that 

U + V = — + -T < 1 

n n[n — 1) 

is impUed by 

TV N 

-nni+ni-q + r? = ni + (^jrij)'^ -^nj{jni + f) 

N N 

= X) /("j - nj) + ni+ ^ jnjrnr - ji 

j=l j,r=l,...,N,jj^r 3=1 

N 

(21) > {n\ — n\) +n\ + ni ^ jrij — > 

□ 

Lemma 8.2. For every sample, N + U — 1 — nU > 

Proof. By definition of U we have 

N + U-l-nU = N-ni + — -1, 

n 

then either n = ni = N and the right hand side becomes 0, or iV — ni > 1 
and the relation holds. □ 

Lemma 8.3. For every sample, {VnN — VN + N — n)n = qN — > 

Proof. Expressing q, N and n as function of the n/s we get 

N = E% 
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Then 

j>0 fc>0 j>0 

j>Okj^j 

= EE (-^^ ~ + k'^ - kj)njnk 

j>Oj<k 

= EE(j-^)'%"fc^o 

j>Oj<k 



□ 



Therefore, in the sequel we assume that u and v satisfy the following 
relations: 

(22) < < 1 

(23) < v<l 

(24) 1 > u + v 

(25) < N + u-l-nu 

(26) < vnN-vN + N-n 

Theorem 8.1. Let 

vn{n - 1) + n + n(2A + 1) + A(A + l)^^ 
[" + -^ i-« +l][n + A ] 

we /taue 

1 < A2 and 1 > A2, i.e. ''Zi+tn <u<l-v 

' —1 

2—v+vn 



A = arff inf 
A>1 



/(A) 



A2 i/ /?i < A2 and \2>l, i.e.l-Nv <u< ^"^^^^^^ 



00 i/ A2 < A, i.e. < II < 1 - A^i;. 
where Pi = is the largest singularity of f{X) and 

1 — u — v + uv — uvn 



A. 



2 



Nv+u-1 
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Proof. The equation /(A) = has solutions: 

, „, , — 2n + nu 
(27) Ai = <0 

1 — u — v + uv — uvn 

= Nv + u-l 

The root Ai is always non positive and thus it is not interesting and if 

1-u- v + uv -uvn 

l^yj ^2 = r-j — : : > 1, 

l\V + u — 1 

then A2 achieves the required minimum. 

To evaluate the other cases note that the function /(A) has two poles 

(30) ^1 = 

, , ^ n 1 — u 

* = -N- — 

and Ai < /32 < /3i. Moreover, 

,. .{u-l){vnN -vN + N -n). 

hm / A = 00 • sgn( ) = -00 

by ([22]) and ([26]), and 

(32) lim /(A) = ^X!i±^. 

A— >+c» iV 

We now verify that 

Lemma 8.4. /(A) is increasing for \> (3i. 
Proof. Let /'(A) = , • Then 

(33) hm g{\) = n(l - uf{vnN - vN + N - n) > 



by (j26j) . Note that g satisfies 

g'{X) = 2N^{N + u-l-nu)X 

+2nN{-l - n + 2N + u - Nu - Nv + uNv + Nuv - uNuv) 

with the leading coefficient nonnegative by (j25p . Therefore, if A > /3i = — 

g'{\) > 2nN{l - u){vnN -vN + N-n)>0 

again by 1^. Thus g' > for all X > f3i and, by ([MD, 5 > for all X > Pi 
and since the other factors in /' are also positive, we have that /' > for 
all A > /3i as required. □ 
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Now there are three possibihties. 



1. If n < 1 — Nv then from ()32p and the above Lemma, it follows that 
/ < for all A > /?! and increasing, thus 

A = arg inf |/(A)| = argmaxA > 1/ = +oo. 

1 

2. If 1 - Nv < u < then from ([29]) A2 > 1 is equivalent to u < "^^-i+tn ' 
in which case A = A2. 

3. If < u then A9 < 1 and by the Lemma above 

A = arg inf |/(A)[ = argmin A > 1/ = 1 

1 

The conditions on u and v are translated into those for A2 and /?i by 
direct calculation. 

□ 
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